Welcome to an EDA of the McDonald’s India Menu - Nutrition dataset, uploaded by Deep Contractor.
You don’t have to be a health expert to know fast food isn’t good for you. But, how bad is it truly? Can we find patterns in the design of their menu? Let’s explore.
Read through the initial setup in the 4 tabs below.
First, some I import some useful libraries and set some plotting defaults.
# Data Manipulation
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(purrr)
library(stringr)
library(urltools)
library(magrittr)
# Plots
library(ggplot2)
library(naniar)
library(packcircles)
library(ggridges)
library(ggbeeswarm)
library(patchwork)
# PCA
if(!require(FactoMineR))
remotes::install_cran("FactoMineR")
if(!require(factoextra))
remotes::install_cran("factoextra")
library(FactoMineR)
library(factoextra)
# Tables
library(reactable)
# Settings
theme_set(theme_minimal(
base_size = 13,
base_family = "Menlo"))
theme_update(
plot.title.position = "plot"
)Let’s start be reading in the data. There is only one CSV file, with
the menu items and some measurements on each item.
{janitor::clean_names} helps us get clean column names
quickly.
dat <- read_csv("../input/mcdonalds-india-menu-nutrition-facts/India_Menu.csv") %>%
janitor::clean_names()
glimpse(dat, 100)## Rows: 141
## Columns: 13
## $ menu_category <chr> "Regular Menu", "Regular Menu", "Regular Menu", "Regular Menu", "Regu…
## $ menu_items <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSpicy™ Paneer Burger",…
## $ per_serve_size <chr> "168 g", "146 g", "199 g", "250 g", "177 g", "306 g", "132 g", "87 g"…
## $ energy_k_cal <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 356.09, 228.21, 400.8…
## $ protein_g <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.45, 15.66, 15.44, 21…
## $ total_fat_g <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, 11.44, 15.70, 14.16,…
## $ sat_fat_g <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.72, 5.47, 5.79, 7.63,…
## $ trans_fat_g <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0.16, 0.21, 0.18, 0.2…
## $ cholesterols_mg <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.17, 31.17, 32.83, 66.…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, 24.79, 47.98, 38.85,…
## $ total_sugars_g <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, 5.53, 5.58, 5.88, 2.…
## $ added_sugars_g <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4.49, 3.54, 4.49, 1.0…
## $ sodium_mg <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.22, 579.60, 390.74, 7…
I love to take the first peek into a dataset with the amazing {skimr}
package. Few takeaways:
per_serve_size doesn’t sound like it should be of type
charactersodium_mgskimr::skim(dat)| Name | dat |
| Number of rows | 141 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| menu_category | 0 | 1 | 11 | 15 | 0 | 7 | 0 |
| menu_items | 0 | 1 | 8 | 42 | 0 | 141 | 0 |
| per_serve_size | 0 | 1 | 3 | 9 | 0 | 107 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| energy_k_cal | 0 | 1.00 | 244.64 | 185.55 | 0 | 116.36 | 219.36 | 339.52 | 834.36 | ▇▇▃▁▁ |
| protein_g | 0 | 1.00 | 7.49 | 8.34 | 0 | 0.65 | 4.79 | 10.88 | 39.47 | ▇▃▁▁▁ |
| total_fat_g | 0 | 1.00 | 9.99 | 10.34 | 0 | 0.46 | 7.77 | 14.16 | 45.18 | ▇▅▂▁▁ |
| sat_fat_g | 0 | 1.00 | 5.00 | 4.90 | 0 | 0.28 | 4.27 | 7.28 | 20.46 | ▇▅▂▁▁ |
| trans_fat_g | 0 | 1.00 | 0.69 | 6.33 | 0 | 0.06 | 0.15 | 0.22 | 75.26 | ▇▁▁▁▁ |
| cholesterols_mg | 0 | 1.00 | 26.35 | 50.33 | 0 | 1.51 | 8.39 | 31.11 | 302.61 | ▇▁▁▁▁ |
| total_carbohydrate_g | 0 | 1.00 | 31.19 | 20.60 | 0 | 15.74 | 30.82 | 46.00 | 93.84 | ▇▇▆▂▁ |
| total_sugars_g | 0 | 1.00 | 15.46 | 15.69 | 0 | 2.33 | 9.16 | 26.95 | 64.22 | ▇▂▂▁▁ |
| added_sugars_g | 0 | 1.00 | 10.34 | 14.28 | 0 | 0.00 | 3.64 | 19.23 | 64.22 | ▇▂▁▁▁ |
| sodium_mg | 1 | 0.99 | 362.06 | 473.16 | 0 | 43.90 | 152.02 | 534.24 | 2399.49 | ▇▂▁▁▁ |
My favorite way of exploring missing data is to make it visible,
using Nick Tierney’s
amazing {naniar}
package.
We can see the missing sodium_mg rows here.
dat %>%
vis_miss()Since this is an open ended exploration, I will posit some questions which will guide the flow of further work.
To aid answering many of these, I first need to create a few new features in the data set.
We go from 13 columns to 15 columns in the data set.
The serving size by default is a character, with the units (g)
embedded in the column. Let’s clean this up, and rename the column to
our convention {measurement}_{unit}.
dat <- dat %>%
mutate(per_serve_size = as.numeric(stringr::str_extract(per_serve_size, "\\d*"))) %>%
rename(per_serve_size_g = per_serve_size)
glimpse(dat, 80)## Rows: 141
## Columns: 13
## $ menu_category <chr> "Regular Menu", "Regular Menu", "Regular Menu", "…
## $ menu_items <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSp…
## $ per_serve_size_g <dbl> 168, 146, 199, 250, 177, 306, 132, 87, 173, 136, …
## $ energy_k_cal <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 3…
## $ protein_g <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.…
## $ total_fat_g <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, …
## $ sat_fat_g <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.7…
## $ trans_fat_g <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0…
## $ cholesterols_mg <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.1…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, …
## $ total_sugars_g <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, …
## $ added_sugars_g <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4…
## $ sodium_mg <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.2…
I want to explore some of the data using calories as a covariate. Here’s the distribution of calories:
ggplot(dat, aes(x = energy_k_cal)) +
geom_histogram(bins = 50) +
labs(
x = "Energy (kCal)",
y = "Counts"
)I’ll use this distribution as a guide to create some bins for the energy column.
dat <- dat %>%
mutate(
cal_cat = case_when(
energy_k_cal <= 100 ~ "Low",
energy_k_cal > 100 & energy_k_cal <= 400 ~ "Medium",
energy_k_cal > 400 & energy_k_cal <= 600 ~ "High",
energy_k_cal > 600 ~ "Very High"
),
cal_cat = factor(cal_cat, levels = c("Very High", "High", "Medium", "Low"), ordered = TRUE)
)
glimpse(dat, 80)## Rows: 141
## Columns: 14
## $ menu_category <chr> "Regular Menu", "Regular Menu", "Regular Menu", "…
## $ menu_items <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSp…
## $ per_serve_size_g <dbl> 168, 146, 199, 250, 177, 306, 132, 87, 173, 136, …
## $ energy_k_cal <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 3…
## $ protein_g <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.…
## $ total_fat_g <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, …
## $ sat_fat_g <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.7…
## $ trans_fat_g <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0…
## $ cholesterols_mg <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.1…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, …
## $ total_sugars_g <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, …
## $ added_sugars_g <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4…
## $ sodium_mg <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.2…
## $ cal_cat <ord> High, Medium, Very High, Very High, High, Very Hi…
Finally, I’ll add a feature called energy density, which
is the number of calories per weight of food. I suspect things like
cookies and cake (perhaps even the fizzy drinks?) would rank high on
this feature.
dat <- dat %>%
mutate(
energy_density = energy_k_cal / per_serve_size_g
)
glimpse(dat, 80)## Rows: 141
## Columns: 15
## $ menu_category <chr> "Regular Menu", "Regular Menu", "Regular Menu", "…
## $ menu_items <chr> "McVeggie™ Burger", "McAloo Tikki Burger®", "McSp…
## $ per_serve_size_g <dbl> 168, 146, 199, 250, 177, 306, 132, 87, 173, 136, …
## $ energy_k_cal <dbl> 402.05, 339.52, 652.76, 674.68, 512.17, 832.67, 3…
## $ protein_g <dbl> 10.24, 8.50, 20.29, 20.96, 15.30, 24.17, 7.91, 5.…
## $ total_fat_g <dbl> 13.83, 11.31, 39.45, 39.10, 23.45, 37.94, 15.08, …
## $ sat_fat_g <dbl> 5.34, 4.27, 17.12, 19.73, 10.51, 16.83, 6.11, 5.7…
## $ trans_fat_g <dbl> 0.16, 0.20, 0.18, 0.26, 0.17, 0.28, 0.24, 0.09, 0…
## $ cholesterols_mg <dbl> 2.49, 1.47, 21.85, 40.93, 25.24, 36.19, 9.45, 5.1…
## $ total_carbohydrate_g <dbl> 56.54, 50.27, 52.33, 59.27, 56.96, 93.84, 46.36, …
## $ total_sugars_g <dbl> 7.90, 7.05, 8.35, 3.50, 7.85, 11.52, 4.53, 2.73, …
## $ added_sugars_g <dbl> 4.49, 4.07, 5.27, 1.08, 4.76, 6.92, 1.15, 0.35, 4…
## $ sodium_mg <dbl> 706.13, 545.34, 1074.58, 1087.46, 1051.24, 1529.2…
## $ cal_cat <ord> High, Medium, Very High, Very High, High, Very Hi…
## $ energy_density <dbl> 2.393155, 2.325479, 3.280201, 2.698720, 2.893616,…
Now that I have the data sets prepared and ready, it’s time for the fun part - being creative and creating some interesting visuals!
There are four components to the EDA below:
# Custom plotting functions
make_dot_violin_plot <- function(dat,
title = NULL,
xlab = NULL,
ylab = NULL,
subtitle = NULL,
legend_title = NULL,
n_outliers = 1,
annot_xnudge = 0.1,
annot_xoffset = 0.5,
annot_ynudge = 10,
annot_curv = 0.2,
label_size = 6,
he_width = 0.5,
he_alpha = 0.7,
pt_size = 4,
pt_width = 0.2,
pt_alpha = 0.5){
pdat <- dat %>%
mutate(colorby = forcats::fct_reorder(str_wrap(colorby, 10), -y))
outlier <- pdat %>%
arrange(-y) %>%
slice(n_outliers)
g <- pdat %>%
ggplot(aes(x = forcats::fct_reorder(colorby, -y),
y = y)) +
ggdist::stat_halfeye(
aes(thickness = stat(f * n)),
width = he_width,
position = position_nudge(x = .2),
alpha = he_alpha
) +
geom_quasirandom(
aes(color = cat),
size = pt_size,
width = pt_width,
alpha = pt_alpha,
varwidth = TRUE
) +
ggtitle(title) +
labs(
x = xlab,
y = ylab,
subtitle = subtitle,
color = legend_title
) +
theme(legend.position = "bottom")
if (n_outliers > 0) {
g <- g +
annotate(
"curve",
x = as.numeric(outlier$colorby) + annot_xoffset,
xend = as.numeric(outlier$colorby) + annot_xnudge,
y = outlier$y + annot_ynudge,
yend = outlier$y,
curvature = annot_curv,
arrow = arrow(length = unit(10, "pt"),
type = "closed")
) +
annotate(
"text",
x = as.numeric(outlier$colorby) + annot_xoffset,
y = outlier$y + annot_ynudge,
label = outlier$menu_items,
size = label_size,
hjust = -0.1
)
}
g
}Aha! As I suspected, the Muffins are the worst offenders for calories/gram of food. However, look at the Condiments! Those little ketchup and mustard packets are super energy dense!
dat %>%
select(y = energy_density, cat = cal_cat, colorby = menu_category, menu_items) %>%
make_dot_violin_plot(
he_width = 1.5,
title = "How does Carbohydrate content vary by the menu?",
ylab = "Total Carbohydrate (grams)",
subtitle = "Each point represents the carbohydrate content for 1 serving of the menu item",
legend_title = "Calories kCal",
n_outliers = 1,
annot_ynudge = 0.15,
annot_curv = -0.1
)Q: How does the Gourmet Menu differ from the Regular Menu?
Across the board, the Gourmet menu has items with higher values. Except Energy Density, all the other metrics are right-shifted. This is expected, as Gourmet items are regular items with extra oomph!
dat %>%
filter(menu_category %in% c("Regular Menu", "Gourmet Menu")) %>%
filter(trans_fat_g < 20) %>%
select(menu_category, where(is.numeric)) %>%
tidyr::pivot_longer(-menu_category) %>%
ggplot(aes(x = value,
fill = menu_category)) +
geom_density(alpha = 0.8) +
facet_wrap(~ name,
scales = "free"
) +
ggtitle(
label = "Comparison of the Gourmet & Regular Menu Options"
) +
labs(
x = NULL,
y = "Probability density",
fill = NULL,
caption = "Outlier '5 piece Chicken Strips' with `trans_fat` of 75.3g removed"
) +
theme(legend.position = "top",
plot.subtitle = element_text(face = "italic"))One of my favorite methods of looking at multivariate numerical data is the simple, yet powerful tool - PCA. It’s a great way to extract layers upon layers of information about the dataset from just a few plots. Let’s explore this dataset using PCA.
dat %>%
select(where(is.numeric)) %>%
# Impute the missing sodium_mg values to the median
mutate(sodium_mg = ifelse(is.na(sodium_mg), median(sodium_mg, na.rm = TRUE), sodium_mg)) %>%
scale() %>%
as.data.frame() -> dat_scaled
pca <- PCA(dat_scaled, graph = FALSE)
pca## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 141 individuals, described by 12 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
fviz_eig(pca, addlabels = TRUE, ylim = c(0, 50))fviz_pca_var(pca, repel = TRUE, alpha.var = "contrib")fviz_contrib(pca, choice = "var", axes = 1, top = 10)fviz_contrib(pca, choice = "var", axes = 2, top = 10)fviz_cos2(pca, choice = "var", axes = 1:2)fviz_pca_ind(pca, col.ind = dat$menu_category, select.ind = list(cos2 = 0.7))## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 16 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
fviz_pca_biplot(pca, col.ind = dat$menu_category, select.ind = list(cos2 = 0.8))That was a fun exploration of these data. What else can you think of to explore?
Cheers!